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1.  Introduction 


In  this  effort,  we  consider  the  problem  of  learning  low-dimensional  representations 
of  objects  from  a  spatially  distributed  network  of  sensors.  Such  a  sensor  network 
can  be  used  to  construct  rich  appearance  models  for  objects  in  their  common  field 
of  view.1  These  models  can  then  be  used  in  a  variety  of  ways,  for  example,  to 
identify  previously  seen  objects  if  they  reappear  in  the  network  at  a  later  time,  or 
distill  important  common  or  discriminating  characteristics  associated  with  different 
objects. 

As  an  example,  consider  a  network  of  cameras  capturing  images  of  an  object  from 
different  but  possibly  overlapping  aspects,  say,  as  the  object  traverses  through  the 
network’s  field  of  view.  The  ensemble  of  images  captured  by  the  network  may  be 
well  modeled  by  a  low-dimensional  nonlinear  manifold  in  the  high-dimensional 
ambient  space  of  images.  One  approach  to  estimating  this  appearance  model  might 
be  to  learn  independent  models  of  a  local  object  manifold  at  each  sensor  node,  and 
then  later  share  these  models  across  the  network.  Such  an  approach  is  likely  to 
produce  models  with  high  uncertainty  or  even  gaps  if  a  given  sensor  node  observes 
the  object  for  only  a  limited  set  of  aspects.  An  alternative  approach,  and  the  one  we 
pursue  in  this  report,  is  to  construct  a  single  joint  model  for  the  image  ensemble 
across  the  network.  The  parameter  estimates  of  the  joint  model  will  improve  with 
the  number  of  sensor  nodes,2  since  the  number  of  unknown  parameters  in  the  model 
is  intrinsic  to  the  object  and  fixed,  whereas  the  measurements  scale  linearly  with  the 
number  of  sensor  nodes. 

We  model  the  overall  statistics  of  the  observations,  as  seen  across  multiple  aspects 
and  multiple  sensors,  as  a  mixture  of  factor  analyzers  (MFA)3  and  derive  a  cen¬ 
tralized  gradient-based  algorithm  for  learning  model  parameters.  The  MFA  model 
is  both  probabilistic  and  generative,  and  can  be  used  for  dimensionality  reduction, 
manifold  learning,  and  signal  recovery  from  compressed  sensing.1  In  the  case  of 
learning  a  data  manifold,  the  MFA  model  is  a  linearization  of  a  potentially  nonlin¬ 
ear  structure.  Each  MFA  factor  mean  relates  to  a  point  on  the  manifold,  while  the 
tangent  plane  of  the  manifold  at  that  point  is  spanned  by  the  columns  of  that  factor’s 
loading  matrix. 

Factor  analysis  has  been  used  successfully  in  many  problems.1-4  Ghahramani  and 
Hinton3  applied  the  expectation-maximization  (EM)  algorithm  to  parameter  esti- 
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mation  for  the  MFA  model.  The  EM  algorithm5  is  a  popular  iterative  method  for 
maximum-likelihood  (ML)  estimation  with  local  convergence  properties  and  a  sim¬ 
ple  implementation  for  many  applications  in  statistical  signal  processing.  However, 
in  many  practical  scenarios,  it  can  exhibit  slow  convergence,6  which  leads  to  re¬ 
search  in  acceleration  methods,  notably  hybrid  methods 7  that  complement  EM  with 
information  from  the  problem’s  likelihood  and  its  gradient.  In  this  report,  we  derive 
a  constrained  Fisher  scoring  method  that  can  be  viewed  as  a  hybrid  approach  since 
a  subset  of  the  parameter  updates  have  an  equivalence  with  the  EM  algorithm  for 
the  MFA  model. 

The  relationship  between  the  method  of  scoring  and  the  EM  algorithm  for  the  ex¬ 
ponential  family  of  distributions  was  first  described  by  Titterington.8  The  specific 
relationship  for  a  Gaussian  mixture  model  (GMM)  was  later  formulated  by  Xu  and 
Jordan,9  where  they  showed  EM  steps  can  be  related  to  the  score  function  in  the 
parameter  space  through  a  positive  definite  scaling  matrix.  In  Xu  and  Jordan,9  the 
scaling  matrix  was  teased  out  from  the  EM  update  equations.  Alternatively,  one 
can  show  that  this  matrix  is  the  expected  Fisher  information  matrix  (FIM)  of  the 
complete  data  model  for  a  GMM.  In  this  case,  the  scoring  method  provides  a  more 
procedural  approach  for  determining  the  gradient- scaling  matrix  and  enables  con¬ 
vergence  analysis  using  standard  tools  from  optimization  theory. 

The  contributions  of  this  report  are  as  follows. 

First,  we  develop  a  computationally  attractive  method  of  scoring  for  estimating  the 
parameters  of  a  MFA  model  from  measurements  collected  by  a  spatially  distributed 
sensor  network.  The  algorithm  is  a  centralized  approach  in  that  it  assumes  that  the 
sensed  data  can  be  accumulated  at  a  single,  central  point  for  joint  data  processing. 
The  proposed  scoring  algorithm  is  derived  using  the  complete  data  FIM,  as  opposed 
to  the  incomplete  data  FIM.  The  complete  data  FIM  has  a  block-diagonal  structure, 
leading  to  significantly  reduced  computations  compared  to  Newton’s  method.  Fur¬ 
thermore,  the  FIM  exposes  a  low-rank  structure  that  permits  further  reductions  in 
computations.  The  scoring  method  is  shown  to  have  faster  convergence  than  the  EM 
algorithm  for  a  MFA,3  sometimes  significantly  faster,  while  retaining  comparable 
computational  complexity. 

Second,  we  demonstrate  the  efficacy  of  the  constrained  scoring  approach  for  ef¬ 
ficiently  federating  across  the  entire  sensor  network  a  global  appearance  model 
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of  objects,  even  if  each  sensor  observes  only  a  very  limited  subset  of  the  entire 
model  appearance.  Thus,  the  approach  presented  in  this  report  is  an  efficient  method 
for  learning  a  global  model.  Previously  developed  methods10  provided  learning  of 
aspect-independent  object  signatures;  in  this  work,  we  develop  a  method  for  learn¬ 
ing  appearance  models  with  aspect  dependence. 

The  remainder  of  this  report  is  outlined  as  follows.  In  Section  2,  we  outline  the  MFA 
observation  model.  In  Section  3,  we  derive  the  update  equations  of  the  centralized 
scoring  method  for  the  MFA  model.  In  Section  4,  numerical  examples  demonstrate 
the  improved  performance  of  the  centralized  algorithm  over  EM.  Finally,  conclu¬ 
sions  are  given  in  Section  5. 


2.  Mixture  of  Factor  Analyzers 

Consider  a  set  of  M  spatially  distributed  sensors  that  observe  a  scene.  Sensor  nodes 
collect  N  vector  observations  each,  denoted  xmi  6  Rp,  for  m  =  1,2 , . . . ,  M  and 
i  =  1,2 , ,N.  The  observations  are  statistically  independent,  but  not  identically 
distributed  across  the  sensor  network.  Each  observation  from  the  mth  sensor  node 
is  modeled  as  a  mixture  of  Gaussians,  with  likelihood  given  by 

K 

^  ^  &mk’Af(xi  Efc) ,  (1) 

k=  1 


where 


Af(x-,  /x,  E) 


1 

V(27r)pls 


exp 


/i)tE  1(a? 


The  covariance  takes  on  a  low-rank  structure  according  to 


(2) 


Efc  'ipk^p  T  AfcAk, 


(3) 


where  ipk  >  0  and  Ak  e  Rpxr,  with  r  <  p.  The  matrix  Ak  is  called  the  factor  loading 
matrix  and  is  called  the  uniqueness .n  The  factor  loading  matrix  characterizes 
the  lower-dimensional  latent  space  and  the  uniquenesses  account  for  observation 
noise  and  imperfect  modeling  as  a  low-rank  structure.  Each  ornk  represents  the 
mixing  proportion  of  factor  k  that  sensor  m  observes.  Each  sensor  node  observes 
the  common  factors  with  potentially  differing  proportions. 
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The  vector  G  consists  of  the  parameters  am  =  [aml, . . . ,  amK]T  for  m  —  1,2 , ,M 
and  $,k  =  [fij,  X 1,  7pk] T  for  k  =  1, . . . ,  K  where  A k  =  vec  (Afc).  We  specify  the 
vector  of  unknown  parameters  as 


0=  [£,&■■■,* 


>«Af 


(4) 


The  log-likelihood  function  of  6  given  observation  x  at  sensor  node  m  is  given  by 


£m{6;x)  =  log pm(x;  9). 


(5) 


Given  N  independent  observations  from  each  of  the  M  sensors,  the  data  log-likelihood 
function  is  then 


M  N 

£(0;X)  =  EE  £m(G,Xmi)j  (6) 

m=  1  i=  1 

where  X  represents  the  collection  of  NM  observations  from  across  the  sensor  net¬ 
work. 

To  simplify  the  notation,  it  is  assumed  the  sensor  nodes  all  collect  the  same  number 
of  observations  N .  However,  the  model  and  the  proposed  algorithms  that  follow  can 
be  readily  modified  to  accommodate  differing  numbers  of  observations. 

It  is  common  to  treat  the  MFA  model  as  having  latent  variables,  variables  that  are 
not  directly  observed.  The  MFA  model  can  be  interpreted  as  having  both  continuous 
and  discrete  latent  variables  (e.g.,  see  McLachlan  and  Peel12).  We  consider  only  the 
discrete  latent  terms.  We  denote  by  z  €  {1,  2 , ,K}  the  index  to  which  factor 
generated  observation  x.  Instead  of  modeling  continuous  latent  terms  to  describe 
the  factors,  the  low-rank  structure  in  Eq.  3  is  favored  as  a  parametric  model  of  each 
factor’s  covariance.  The  joint  distribution  of  the  complete  data  pair  {x,  z}  at  sensor 
node  77i  is  given  by 

Pm(x,  z;  6)  =  pm(z ;  0)p{x\z\  0)  =  amzJV(x ;  /j,z,  E*).  (7) 

The  complete  data  log-likelihood  of  G,  given  the  MN  independent  observation 
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pairs,  is  given  by 


M  N 

m  x,z)  =  Y.T.  Mw  £*J- 

m=  1  2—1 


(8) 


3.  Constrained  Fisher  Scoring 

It  is  well  known  that  no  closed-form  solution  exists  for  ML  estimates  of  the  param¬ 
eters  of  an  MFA.  Instead,  we  must  rely  on  iterative  algorithms,  such  as  Newton’s 
method  or  EM.  While  Newton’s  method  may  converge  to  a  solution  quickly,  the 
computational  complexity  tends  to  be  impractical.  On  the  other  hand,  the  computa¬ 
tional  complexity  of  EM  per  iteration  tends  to  be  favorable,  but  its  convergence  rate 
can  be  less  favorable.  As  a  balance  between  Newton’s  method  and  EM,  we  consider 
Fisher’s  method  of  scoring.  Similar  to  EM  but  unlike  Newton’s  method,  the  scor¬ 
ing  method  requires  only  first-order  derivatives.  Though  like  Newton’s  method,  the 
performance  of  the  scoring  method,  in  terms  of  convergence  and  convergence  rate, 
can  be  analyzed  using  standard  techniques  from  optimization  theory. 

As  an  unconstrained  problem,  parameters  of  the  MFA  are  unidentifiable.11  There¬ 
fore,  we  must  impose  constraints.  One  immediate  constraint  is  the  mixing  propor¬ 
tions  be  proper  probabilities  (i.e.,  arnk  >  0  and  arnk  =  1).  Additionally,  the 
uniquenesses  must  be  positive  and  the  factor  loading  matrix  must  be  restricted  to 
achieve  identifiability.  It  is  easy  to  show  that  there  are  an  infinity  of  solutions  for 
each  factor  loading  matrix  A  with  equivalent  product  AAT.  There  are  many  ways  to 
constrain  the  factor  loading  matrix  to  ensure  identifiability.13  One  particularly  use¬ 
ful  condition  is  that  the  factor  loading  matrix  be  a  lower-triangular  matrix,  which 
removes  the  indeterminacy  due  to  matrix  rotations.13 

We  denote  by  f(9)  the  vector  of  constraints  on  the  unknown  parameters  such  that 
f(9)  =  0.  The  constrained  ML  problem  is  given  by 

ma x£(9]X)  s.t.  f(9)=  0.  (9) 

0 

The  maximizer  of  Eq.  9  can  be  solved  iteratively  via  the  constrained  scoring  method.14 


Approved  for  public  release;  distribution  is  unlimited. 


5 


The  constrained  scoring  method  is  an  iterative,  gradient-based  algorithm  given  by 


e«+i>  =  em  +  (ue  (ulJeue)-1  ulve((e-.X) 

where  Jg  is  the  (unconstrained)  expected  FIM,  defined  by 


(10) 


Je  =  E(V^(0;X)V^(0;X)),  (11) 

matrix  Ug  is  defined  by  the  constraints  on  6,  and  V g£(0;  X )  is  the  score  function 
of  the  (incomplete)  data  log-likelihood.  In  Moore  et  al.14  and  Stoica  and  Ng,15  the 
matrix  Ug  is  required  to  be  orthonormal  (i.e.,  UgU0  =  I).  However,  as  shown  in 
Appendix  A,  the  orthonormality  requirement  is  unnecessary. 

Instead  of  using  the  FIM  in  Eq.  11,  we  propose  substituting  the  complete  data  in¬ 
formation  matrix.  The  complete  data  FIM  is  defined  according  to 


Jce  =  E  (V0£c(0;  X,  Z)\7Telc(0\  X,  Z))  .  (12) 


This  substitution  was  previously  proposed  for  recursive  estimation  with  incomplete 
data.8  Though  not  proven  directly,  Titterington  states  equivalence  between  batch¬ 
mode  EM  and  (unconstrained)  Fisher  scoring  when  substituting  the  complete  data 
FIM  and  when  the  complete  data  can  be  expressed  as  an  exponential  family  dis¬ 
tribution.8  We  do  not  arrive  at  identical  update  equations  between  unconstrained 
scoring  and  EM.  Instead,  we  find  equivalence  for  a  subset  of  the  equations  between 
EM  and  constrained  scoring.  In  particular,  the  iterates  for  the  mixing  proportions 
are  identical  between  EM  and  constrained  scoring.  The  iterates  for  the  each  factor’s 
means  are  equivalent  in  an  asymptotic  sense, 


In  Eqs.  1 1  and  12,  the  expectations  are  with  respect  to  X  and  (X.  Z),  respectively. 
For  data  set  X,  the  factor  indicators  are  not  observed  directly.  In  this  view,  the  FIM 
in  Eq.  11  is  of  the  incomplete  data,  whereas  Eq.  12  is  of  the  complete  data.  For 
the  remainder  of  this  report,  references  to  the  FIM  imply  the  expected  information 
matrix  of  the  complete  data  in  Eq.  12.  The  proposed  scoring  method  is  then  given 
by 


Q(t+ 1) 


0W+  (UgiUlrgUgY'U^g^O'X) 


0=0 (t> 


(13) 
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For  the  MFA  model,  the  FIM  Jg  has  a  closed-form  expression  and  can  be  de¬ 
termined  through  a  change  of  variables  from  the  standard  GMM.  Define  $  = 
[nj,  a}, . . . ,  <rj0  ,  aXf]T,  where  <rk  =  vec  (£fc).  The  information  ma¬ 

trix  of  the  MFA  model  is  given  by 

Jce  =  GTgJc*Ge,  (14) 


where  J $  is  the  complete  data  FIM  of  the  GMM  defined  by 

4  =  E  (V*fc($;  X,  Z)V^C($;  X,  Z))  ,  (15) 


and  Gg  is  the  Jacobian  matrix  for  the  change  of  variables  defined  by 


Ge  = 


<9$ 

w 


(16) 


We  admit  a  slight  abuse  of  notation  in  Eq.  15  by  substituting  $  for  6  in  tc  from 
Eq.  8.  This  is  for  notational  convenience  intended  to  simply  imply  the  complete 
data  log-likelihood  function  of  the  GMM  without  the  low-rank  structure  in  Eq.  3 
imposed  on  the  covariance. 


The  details  of  J'j,  and  Go  are  provided  in  Appendix  B,  where  J&,  the  FIM  of  the 
GMM,  is  shown  to  be  block  diagonal.  It  is  clear  from  the  definitions  of  6  and  $ 
that  a  change  of  variables  is  necessary  only  for  each  factor’s  covariance.  Thus,  the 
FIM  of  the  MFA  has  the  block-diagonal  form 


JS  =  diag  (jM_,  ....  J„A„  (GlKJ„KG„K),  J„„ . . . ,  J„„)  . 


(17) 


This  represents  the  unconstrained  information  matrix  of  the  complete  data  model. 
However,  the  MFA  is  overparameterized  and  not  identifiable.1113  While  the  factor 
means  may  remain  unconstrained,  the  factor  loading  matrices,  the  uniquenesses, 
and  the  mixing  proportions  should  be  constrained.  The  details  of  these  constraints 
are  provided  in  the  following  subsections.  First,  we  remark  that  the  constraints  are 
decoupled  across  the  parameters  such  that  the  constrained  FIM  for  the  MFA  takes 
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on  the  block-diagonal  form  given  by 


ulereue  = 


diag  ( JMl,  C£,  JCT1 GCT1  C/CTl ),  JMo,  ( Ul^J^G^U \ 


M2’ 


r2  CT2  CT2  cr2  cr2 1  ’ 


•V,  (VlGlwJ„G„U„\  (Ul,  Ja,Ua,), ....  (t/7  J„ 


As  a  result  of  this  form,  the  parameter  iterates  of  a  factor  are  decoupled  from  those 
of  other  factors.  Also,  for  a  given  factor,  the  iterates  of  the  means  and  mixing  prob¬ 
ability  decouple.  In  contrast,  the  parameters  of  the  structured  covariance  remain 
coupled.  Subsequently,  the  constrained  scoring  method  in  Eq.  13  factors  into  a  set 
of  update  equations  given  by 


cl 


(*+1)  =  ryG) 
m  ^ 

(*+ 1) 


+  (Ua  (Ul  Ja  Ua  )  'Ul  Va  t(0;X) 


0=9<-t)  ’ 


0=0(t)  ’ 


'  Af+1) ' 

"  A?  ' 

.  4f+1) . 

1 

_ i 

+  (  U„  (UlGl  J_  G_  U„  )  1  Ul  Gl  V„.  e(&;  X ) 

\  &k  V  CT/c  CT/c  CT/c  &k/  &k  & k  &k  '  ' 


(19) 

(20) 

e=ew  ’ 

(21) 


where  cr[9  is  evaluated  by  substituting  (Aj.,),  in  Eq.  3.  Though  they  appear 
fairly  complex,  the  update  equations  significantly  simplify.  The  details  of  the  update 
equations  and  their  simplification  are  provided  in  the  following  subsections. 

3.1  Mixing  Proportions 

In  this  section,  we  derive  an  explicit  expression  for  iterates  of  the  mixing  propor¬ 
tions  that  is  simplified  compared  to  Eq.  19  and  achieves  the  desired  probability  con¬ 
straints.  The  iterates  of  each  mixing  probability  basically  reduce  to  sample  averages 
of  posterior  probabilities  of  an  observation  belonging  to  a  corresponding  factor. 

We  briefly  drop  the  dependence  on  the  specific  sensor  node  m  to  simplify  the  nota¬ 
tion  and  note  that  the  following  results  are  identical  for  each  m  —  1, . . . ,  M .  We  de¬ 
fine  gl  —  ax}T  as  the  vector  of  mixing  proportions  and  a  =  [on, . . . ,  oik-i}1 

as  the  reduced  parameter  vector. 

The  probability  constraint  can  be  applied  through  a  change  of  variables  since  we 
have  a  parametric  form  of  cl  with  respect  to  the  reduced  parameter  vector  cl ,16 
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From  the  sum-to-one  condition,  we  have  olk  =  1  —  lTd,  and  for  the  change  of 
variables  we  have 


Un  = 


d  dT,  1  —  lTd 
daT 


I/c-i 

-1T 


(22) 


From  Eq.  B-8  in  Appendix  B,  the  term  Ja  for  the  mixing  proportions  from  the  FIM 
is  given  by 


Ja  =  diag  (a)  1 


(23) 


From  Eqs.  22  and  23,  the  inverse  of  matrix  U^JaUa  evaluates  to 


KJaua) 


-i 


i 

N 

1 

N 


(diag  (d)  1  +  ax1llT) 

(diag  (d)  —  diag  (a)  1  (aK  +  lTdiag  (a)  l)  1  lTdiag  (d) 


1 

N 


(diag  (d) 


—  ddT)  . 


(24) 

(25) 


Equation  24  follows  from  application  of  the  Woodbury  matrix  identity  and  Eq.  25 
follows  since  aK  +  lTdiag  (d)  1  =  Y^k=\  °K  =  L  Lastly,  it  is  straightforward  to 
show  that 


Uc  KJaua)-'  Ul  =  4 


diag  (d)  —  dd1 

■  T 


—  (XkOL 


—  Q!Ka 


aK  otK 


=  (diag  (a) 


OtOt 


r)' 


(26) 


In  Xu  and  Jordan,9  it  was  shown  that  Eq.  26  is  positive  definite  provided  at  is  con¬ 
strained  to  a  probability  simplex.  Furthermore,  it  is  shown  by  Xu  and  Jordan9  that 
the  iteration  in  Eq.  19  with  the  scaling  matrix  in  Eq.  26  simplifies  to 


a 


6+i) 

mk 


l 

N 


V  w{t) 

/  j  w mik 


i=  1 


(27) 


where 


W mik 


QLrnk'N’ IJ'k')  ^k) 

Ef=i  amjN{x  mii 
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for  k  =  1 ,K,  m  =  1, . . . , M,  and  w^ik  is  Eq.  28  evaluated  at  6  =  6{t\ 
From  Eq.  28,  it  is  easy  to  see  that  Eq.  27  and  therefore  Eq.  19  meet  the  probability 
constraints.  Subsequently,  Eq.  26  is  positive  definite  at  6  =  0U' .  Thus,  the  iteration 
in  Eq.  19  strictly  increases  the  likelihood  with  respect  to  the  mixing  proportions,  at 
least  locally. 

3.2  Factor  Means 

Similar  to  the  mixing  proportions,  the  update  equations  for  the  factor  means  sim¬ 
plify.  The  score  function  with  respect  to  fik  equates  to 

M  N 

X)  =  ^  wmikY^{xmi  -  (29) 

771=1  1=1 

for  k  =  1, . . . ,  K.  From  Eq.  B-7  in  Appendix  B,  the  term  .//(/  for  the  factor  mean 
from  the  FIM  is  given  by 


M 


•4'  t:  —  N  amkXk  . 


(30) 


771=1 


Inserting  Eqs.  29  and  30  in  Eq.  20,  the  iteration  for  the  kth  factor’s  mean  reduces  to 

Ym  Yn  w w  (x  •  -  uw) 

Am=l  Ai=l  H'k  ) 


(t+1)  (t)  . 

Vk  =fH  + 


N  Eti 


(31) 


a 


mk 


Provided  the  uniquenesses  are  positive,  the  matrix  .//t/  is  guaranteed  to  be  positive 
definite  by  the  definition  of  the  covariance  in  Eq.  3  when  evaluated  at  6  =  6{t\ 
Therefore,  the  iterations  in  Eqs.  20  and  31  strictly  increase  the  likelihood  with  re¬ 
spect  to  each  factor’s  mean. 

3.3  Factor  Loading  Matrix  and  Uniqueness 

In  this  section,  we  focus  on  detailing  the  scoring  function  and  the  constrained  FIM 
in  Eq.  21  with  respect  to  the  covariance  parameters  (A k,ipk)-  The  score  function 
with  respect  to  the  parameters  (\k,  Wk)  is  found  through  a  change  of  variables  from 
the  score  function  with  respect  to  crk.  It  is  straightforward  to  show  that  the  score 
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function  with  respect  to  crk  is  given  by 


1  M  N 

*>  =  ;££  WmifcVeC  ( Xmi  Hk){xmi  g*k)  ) 


ra=l  2=1 


(32) 


We  briefly  drop  the  dependence  on  the  specific  factor  k  to  simplify  the  notation  and 
note  that  the  following  results  are  identical  for  each  k  —  1, . . . ,  K.  For  the  change 
of  variables,  we  have 


Go-  —  [G\,g^\ 


dcr  dcr 

d\T  ’  dip 


The  partial  derivatives  in  Eq.  33  evaluate  to 


(33) 


Gx  =  (A  <8>  IP)  +  (Ip  ®  A)  E,  (34) 

9p  =  [(e?  <g>  e\) ,  (ep2  <g>  ep2 ) , . . . ,  (ep  <g>  epp)]  1,  (35) 


where  ( A  ®  B)  represents  the  Kronecker  product  between  matrices  A  and  B  and 
e“  is  a  (a  x  1)  unit  vector  such  that  all  the  entries  are  0  except  the  ith  entry  is  1.  The 
matrix  E  €  R?)rxpr  is  defined  by 

<Tvec  (AT) 

E  =  a - 777T  =  [Op  ®  el) ,  (Ip  ®  el) , . . . ,  (Ip  ®  e;>] ,  (36) 

ovec  (A) 

Matrix  E  is  a  permutation  matrix  that  satisfies  ET E  =  77f7T  =  lpr.  Vector  gv  e 
Kp2xl  simply  sums  the  diagonal  elements  of  a  vectorized  (p  x  p)  matrix. 


From  Eqs.  32  and  33,  it  is  straightforward  to  show  that  the  score  function  with 
respect  to  the  parameters  (A,  ip)  is  given  by 


GlVJ($:X) 


’  2(AT®IP)  Vct£($;X)  ' 

'  vA mx) ' 

glVJ^-X) 

_  V/(0;X)  _ 

(37) 


We  constrain  each  factor  loading  matrix  to  be  lower  triangular.  This  constraint  per¬ 
mits  local  identifiability  of  the  factor  loading  matrices.13  The  lower-triangular  con¬ 
straint  is  also  convenient  since  it  is  a  linear  transformation  of  the  loading  matrix  in 
vector  form.  The  reduced  parameter  vector  of  the  factor  loading  matrix  is  defined 
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by 


A 


A!,  A 


2  ) 


(38) 


where  A*  =  [Al  i?  Alji+1, . . . ,  AliP]T  is  the  Th  column  vector  from  A  not  including 
the  elements  above  the  diagonal.  The  constrained  factor  loading  may  be  expressed 
according  to 


A  =  Ux  A,  (39) 

where  U\  G  ]gPrxPr~r(r-1'i/2  simply  expands  A  to  A  by  inserting  zeros  at  the  appro¬ 
priate  entries.  Specifically,  U\  is  given  by 

Ux  =  [K  ®Xi) ,  (er2  ®X2) , . . . ,  «  ®Xr)] ,  (40) 


where 


X,; 


0(i—  l)xp—  (i— 1) 
Ip-(i-l) 


(41) 


The  term  0ax6  represents  a  (ax  6)  matrix  of  zeros.  It  is  clear  that  UjUx  =  lpr-r(r-i)/2 
so  that  Ux\  =  A. 


We  do  not  constrain  the  uniquenesses  to  be  positive  within  the  gradient  steps.  In¬ 
stead,  each  uniqueness  is  projected  onto  the  interval  [e,  oo)  for  a  small  e  >  0 

after  each  iteration.  As  a  result,  the  constraint  matrix  Ua  for  the  gradient  update  in 
Eq.  21  is  given  by 


U„ 


Ux  0 
0T  1 


(42) 


In  the  reduced  parameter  space,  we  have 


UlGlVJ^-X) 


v-xmx) 

V/(0;  X) 


(43) 
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The  constrained  FIM  with  respect  to  a  factor’s  covariance  parameters  is  given  by 


Ul<?„J„G„V„  =  -  Y,  «mATA, 


M 


m=  1 


(44) 


where 


A  =  (S"1/2  ®  £"1/2)  GaUa.  (45) 

Since  (E_1  (8)  S_1)  is  positive  definite,  a  (positive)  matrix  square  root  is  defined. 

Provided  Eq.  44  is  positive  definite,  the  iteration  in  Eq.  21  will  strictly  increase  the 
data  likelihood  with  respect  to  the  covariance  parameters.  Matrix  ATA  is  clearly 
either  positive  definite  or  positive  se/mdefinite.  Either  way,  the  iteration  in  Eq.  21 
will  not  decrease  the  likelihood.  Matrix  ATA  is  positive  definite  if  A  has  full  column 
rank.  Defining  (E-1/2  <8>  E”1/2)  as  the  positive  root  so  that  it  is  positive  definite,  we 
have  the  equivalent  condition  that  ATA  is  positive  definite  if  GaUa  has  full  column 
rank. 


Inserting  Eqs.  44  and  43  in  the  iteration  in  Eq.  21,  the  iteration  for  the  reduced 
parameters  is  given  by 


1 

+ 

1 _ 

1 

S  jg 

1 _ 

1 

+ 

1 

1 _ 

+ 


N  TM  a{t\ 

Z—/m=l  mk 


v^(e-.x) 

V/(0;X) 


e=e(t'> 

(46) 


3.4  Convergence  and  Stopping  Condition 

We  briefly  discuss  algorithm  convergence  and  provide  a  stopping  condition  for  the 
iterative  algorithm.  We  set  Pg  =  U0  (U0J0Ug)  1  Ug.  For  a  sufficient  number  of 
iterations,  the  scoring  method  converges  according  to9 


Q(t+1)  _ 


<  Pg 


e(t)  -  e* 


(47) 


if  pg  <  1  for  fixed  point  0*,  where  pg  —  \l  +  Pg*Hg*\.  The  term  II o  is  the  Hessian 
matrix  of  the  data  log-likelihood  defined  by 


He  =  VeV^(0;X). 


(48) 
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The  rate  pg  depends  on  the  form  of  matrix  Pg*  and  how  well  it  conditions  He*. 
Using  the  triangle  inequality  and  repeated  application  of  Eq.  47,  the  change  between 
consecutive  iterates  is  upper-bounded,  for  large  t,  according  to 

0(m)  _  0(t)  <  pt^Cgi  (49) 

where  Cg  —  (1  +  pg)  6(<>>  —  6*  >0.  If  pg  <  1,  this  change-error  also  predictably 
contracts.  Unlike  Eq.  47,  the  left-hand  side  of  Eq.  49  does  not  require  knowledge  of 
6*.  Therefore,  we  rely  on  the  change-error  to  exit  the  gradient  steps.  For  a  desired 
tolerance  eg,  the  algorithm  exits  when  the  condition  0ln  —  6 <  eg  is  met. 

3.5  Algorithm  Summary 

Figure  1  summarizes  the  constrained  Fisher  scoring  algorithm  for  an  MFA. 
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>  eri 


Input:  Each  sensor  m  measures  xmi  for  i  —  1, . . . ,  N 
Output:  Initialization:  t  =  0,  6{(]\  0  ~ 1 }  s.t.  A  0)  —  0{  1 
while  |0(t)  —  0(£-1}|  >  eg  do 

Update  parameters  via  constrained  Fisher  scoring 
for  k  —  1, ...  ,K,m  —  1, ...  ,M,i  —  1, ...  ,N  do 

Calculate  posterior  probability  of  xmi  belonging  to  factor  k  given  0 1  ( ' 


,(*) 


Wj;u  = 


(t) 
Ot 

—  _ mfc 


E 


Jmik 

end  for 

for  k  —  1, . . . ,  A'  do 

Update  kth  factor’s  parameters 


Mi ‘+1>  =  Mi‘>  +  & 


„(t) 


k(xmi  fJjJ) 


(0 


r  (i+1) 

Ak 

1 

1 _ 

1  2 

'  vkmx)  i 

.  4‘+1)  . 

1 

_ 1 

'  NTm  i  a® 

iV  Z^m=l  Cmfc 

.  v/(e;x)  J 

.  (*+!) 


if  ^t+1) 

v4m) 


=  f/AA[i+1) 

<  0  then 


e= 


=  e 


end  if 
end  for 

for  m  —  I  *  , . . ,  M  do 

Update  mixing  probabilities  for  sensor  m 


OK)  _  j_  sr^JV 
LXm  ~  AT  Z^  i= 1 

end  for 
t<-t  +  1 

end  while 
return  0(t+1'> 


w. 


(t) 

mil’  ' 


(t)  ]T 
>  WrniK\ 


Fig.  1  Constrained  Fisher  scoring  for  an  MFA 
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3.6  Relationship  with  Expectation-Maximization 


The  constrained  scoring  iteration  for  the  mixing  proportions  is  identical  to  that  for 
the  EM  algorithm.3  Furthermore,  the  iteration  for  the  factor  means  are  equivalent 
to  the  EM  algorithm  in  an  asymptotic  sense.  If  in  Eq.  31  we  substitute  J^m= 1  amt^ 

for  Em=1  aml  and  note  that 


M  M  N 

(so) 

m=  1  m=  1  i= 1 

the  iterates  for  each  k  =  1,2,...,  K  factor  mean  becomes 

(t+l)  _  (t)  ,  Lm=l  Et=l  Wmik(X'm-i  —  ) 

Z_/m=l  Z_ /i=l  ^rnik 
w(t)  X  ■ 

_  Z-^dm—  1  Z-^j=l  /r  i  \ 

Ym  Yn  w(t) 

Z-^m= 1  Z-/i=l  mik 

This  iteration  for  the  factor  mean  is  exactly  the  EM-based  iteration.3 

Since  am+1\  for  each  m  =  1,2 , ,M,  converges  to  a  solution,  the  sequence 
{cxlr?,  cxm\  •  •  • ,  otm  }  is  a  Cauchy  sequence.  Thus,  the  difference  ai!k  ' j  —  a ™  can 
be  made  arbitrarily  small  for  a  sufficiently  large  t.  Hence,  in  an  asymptotic  sense, 
the  factor  mean  updates  in  Eqs.  31  and  51  are  equivalent. 

In  terms  of  each  factor’s  covariance  parameters,  the  constrained  scoring  and  EM 
methods  differ.  The  EM-based  iterates  for  the  factor  uniqueness  and  loading  matrix 
are  decoupled,3  while  the  iterates  from  the  constrained  scoring  method  are  coupled. 
This  difference  may  also  manifest  in  the  difference  in  convergence  rates  seen  in  the 
simulation  examples  in  the  following  section. 

4.  Simulation  Examples 

We  present  2  simulation  examples  that  illustrate  the  effectiveness  of  the  proposed 
centralized  scoring  method.  We  first  consider  a  synthetic  MFA  example  under  dif¬ 
ferent  observation  noise  conditions  to  compare  algorithm  convergence  and  com¬ 
putation.  We  then  consider  a  multi-aspect  observation  model  that  demonstrates  the 
benefits  of  information  sharing  in  global  model  learning. 
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4.1  Synthetic  MFA  Example 


We  first  compare  the  performance  of  the  proposed  algorithms  using  the  generative 
MFA  model.  Parameter  estimates  are  generated  using  the  centralized  EM  algorithm 
in  Whipps  et  al., 17  a  constrained  Newton’s  method,  and  the  proposed  constrained 
Fisher  scoring  method.  The  constrained  Newton’s  method  implements  the  gradient 
step  according  to 

(,«+!>  =  0«+i>  +  ,  (52) 

where  Hq  is  the  Hessian  matrix  of  the  data  log-likelihood  defined  by 

He  =  VeVTet{6-,X).  (53) 


In  general  terms,  the  EM  algorithm  for  the  MFA  model  is  simple  computation- wise, 
but  tends  to  converge  slowly.  Newton’s  method  typically  converges  faster  than  first- 
order  methods,  but  at  a  higher  computational  cost  per  iteration.  We  demonstrate 
that  the  constrained  Fisher  scoring  has  EM-like  computational  requirements  with 
improved  convergence  properties. 


We  consider  a  simple  MFA  model.  The  sensor  observations  are  3  dimensional  (p  = 
3)  with  a  2-D  latent  space  (r  =  2)  and  2  mixed  factors  ( J  =  2).  The  latent  object 
is  composed  of  2  planar  segments  in  the  shape  of  a  “L”  that  intersect  along  the 
line  between  points  (—1,0,0)  and  (1,0,0).  Specifically,  the  parameters  of  the  2- 
component  mixture  are  given  by 


'  0  ' 

'  0  ' 

AE  = 

0 

,H2  = 

1 

1 

0 

1  0 

1  0 

Ai  = 

0  0 

,  a2  — 

0  1 

_  0  1  _ 

- 1 

o 

o 

01  =  02  =  0, 

otm  =  [0.75  -  m0.05,  0.25  +  m0.05]T  , 


(54) 


(55) 

(56) 

(57) 


for  m  =  1,  2, . . . ,  9.  A  similar  simulation  example  was  used  by  Baek  and  McLach- 
lan18  to  measure  performance  for  the  single-sensor  case. 
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For  these  simulations,  there  are  M  =  9  sensor  nodes.  For  each  of  the  M  —  9 
sensor  nodes,  N  =  50  samples  are  generated  using  the  generative  MFA  model.  The 
algorithms  do  not  attempt  to  account  for  the  relationship  A/,  =  fik  of  the  simulated 
mixture. 

The  algorithms  are  initialized  by  the  true  values  perturbed  by  small,  independent 
errors.  The  algorithms  are  initialized  with  O"'1  =  V(B  +  u  ),  where  V(0)  projects 
6  onto  the  feasible  set  (i.e.,  mixing  proportions  are  proper  probabilities,  the  unique¬ 
nesses  are  positive,  and  the  factor  loading  matrices  are  lower- triangular),  v  is  a  re¬ 
alization  of  spherical  noise  distributed  according  to  v  A/"(0,  cr2I)  with  a  =  10  2. 
We  note  that  iterative  algorithms  such  as  EM  and  gradient  ascent  are  sensitive  to 
their  initialization,  and  ML  problems  typically  have  multiple  stationary  points.  The 
proposed  algorithm  is  no  different  in  this  regard.  The  /c- means  algorithm  can  be 
viewed  as  a  simplification  of  EM19  and  can  provide  a  fast  initialization  for  all  3 
algorithms.  However,  we  do  not  explore  initialization  methods  here.  To  make  fair 
comparisons,  the  initial  values  are  close  to  the  true  values  and  are  the  same  for  each 
algorithm. 

Figures  2-4  are  results  from  a  single  set  of  NM  =  450  samples  with  each  factor’s 
uniqueness  at  =  1/32.  The  sample  points  are  plotted  in  Fig.  2  as  markers  (each 
marker  type  corresponds  to  a  sensor  node)  along  with  the  2  true  latent  line  segments. 
Figure  2  is  a  2-D  perspective  of  the  3-D  sensor  data.  As  seen  in  Fig.  2,  the  data 
points  are  concentrated  about  their  factors  save  a  few  points  near  the  vertex. 


Fig.  2  3-D  and  2-D  views  of  simulated  sensor  data  of  the  simulated  MFA  with  uniqueness 

i/j  =  1/32  from  M  =  9  sensor  nodes. 
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We  first  compare  the  performance  of  the  proposed  scoring  algorithm  with  (con¬ 
strained)  Newton’s  method  and  the  EM  algorithm.  The  iterative  algorithms  are 
stopped  after  the  error  \0in  l'  —  6{t]\  falls  below  eg  =  1 0 ~ 1 2 . 

Figure  3  demonstrates  algorithm  convergence  by  showing  the  error  \0lt+l>  —  01 
versus  gradient  iterations  from  the  methods  of  constrained  Newton  (+),  EM  (A), 
and  constrained  Fisher  scoring  (o).  As  seen  in  Fig.  3,  the  change-error  from  both 
Newton’s  method  and  Fisher  scoring  decreases  much  faster  than  EM,  with  Newton’s 
method  having  the  steepest  slope. 


Fig.  3  Change-error  vs.  gradient  iterations  from  constrained  Newton  (+),  EM  (A),  and  con¬ 
strained  Fisher  (o)  with  uniqueness  i/j  =  1/32 


Figure  4  plots  the  execution  times  in  milliseconds  reported  by  MATFAB  for  each 
gradient  step  of  each  algorithm  and  illustrates  the  relative  computation  times  for 
the  algorithms.  As  seen  in  Fig.  4,  Fisher  scoring  and  EM  execute  a  gradient  step  in 
nearly  equal  time  and  approximately  10  times  faster  than  Newton’s  method. 


outer  (gradient)  iterations 


Fig.  4  Execution  times  per  gradient  step  of  constrained  Newton  (+),  EM  (A),  and  constrained 
Fisher  (o)  with  uniqueness  ijj  =  1/32 
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Table  1  provides  total  algorithm  execution  times  reported  by  MATLAB  to  reach  the 
error  tolerance  eg  =  10-12,  sample  averaged  over  100  Monte  Carlo  trials;  shown  in 
parenthesis  are  the  execution  time  standard  deviations.  The  2  columns  of  execution 
times  in  Table  1  correspond  to  uniquenesses  of  0  =  1/32  and  ip  =  1/8,  respec¬ 
tively.  As  seen  in  Table  1,  Fisher  scoring  reaches  the  error  tolerance  in  the  shortest 
amount  of  execution  time  of  all  the  methods. 


Table  1  Average  total  execution  times  in  milliseconds  to  reach  a  tolerance  of  eg  =  10-12.  The 
total  times  are  sample-averaged  over  100  Monte  Carlo  trials  with  the  sample  deviation  listed 
in  parenthesis. 


Method 

ip  -  1/32 

0  =  1/8 

Newton 

EM 

Fisher 

1288 (1404) 
992(1134) 
58  (10) 

1551  (1526) 
310(85) 
130  (26) 

Figures  5-7  and  Table  1  show  convergence  and  computation  results  for  the  previ¬ 
ous  example,  but  here  we  quadruple  the  factor  uniqueness  to  0  =  1/8.  As  seen  in 
Fig.  5,  it  is  more  difficult  to  visually  assign  many  of  the  samples  to  a  specific  factor 
with  0  =  1/8.  Thus,  we  might  expect  longer  convergence  times  in  this  case.  This 
is  illustrated  by  comparing  the  convergence  of  Fisher  scoring  between  Figs.  3  and 
6,  where  we  see  somewhat  longer  convergence  times  with  0  =  1/8.  Curiously,  the 
convergence  rate  of  EM  appears  improved,  but  remains  slower  than  Fisher  scor¬ 
ing.  The  total  execution  time  of  Fisher  scoring  remains  shorter  than  both  Newton’s 
method  and  EM,  as  seen  in  the  last  column  of  Table  1  with  0  =  1/8.  Fisher  scor¬ 
ing  outperforms  EM  in  terms  of  convergence  rate  and  Newton’s  method  in  terms  of 
computations  per  iteration. 

Comparing  Figures  4  and  7,  the  execution  times  per  iteration  of  each  algorithm 
between  ip  =  1/32  and  '0  =  1/8  are  essentially  the  same,  as  expected. 
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Fig.  5  3-D  and  2-D  views  of  simulated  sensor  data  of  the  simulated  MFA  with  variance  ip  =  1/8 
from  M  =  9  sensor  nodes 


Fig.  6  Change-error  vs.  gradient  iterations  from  constrained  Newton  (+),  EM  (A),  and  con¬ 
strained  Fisher  (o)  with  uniqueness  ip  =  1/8 


Fig.  7  Execution  times  per  gradient  step  of  constrained  Newton  (+),  EM  (A),  and  constrained 
Fisher  (o)  with  uniqueness  ip  =  1/8 
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4.2  Manifold  Learning  Example 


In  this  example,  we  demonstrate  that  the  scoring  method  benefits  from  information 
sharing  and  integration  in  model  learning.  In  the  previous  examples,  every  node  ob¬ 
served,  with  differing  proportions,  every  factor  of  the  mixture.  In  this  example,  the 
observations  depend  on  the  perspective  of  the  sensors  relative  to  the  target  object. 
Additionally,  the  underlying  structure  is  not  a  finite  mixture,  but  instead  a  smooth 
manifold  in  the  form  a  decaying  spiral.  Specifically,  the  decaying  spiral  is  given  by 

y  =  [(13  —  0.5a;)  cos  a;,  (0.5  —  13)  sin  a;,  a;]T,  (58) 

where  u  G  [0,  47t).  Each  observation  of  the  spiral  is  corrupted  by  additive  white 
Gaussian  noise  with  unit  variance.  This  model  was  used  in  previous  works  to  demon¬ 
strate  the  efficacy  of  learning  an  MFA  model  as  a  surrogate  for  a  nonlinear  mani¬ 
fold.20’21 

The  model  is  set  up  to  illustrate  a  case  in  which  each  node  observes  a  fraction  of 
the  entire  manifold.  Figures  8  and  9  depict  this  case.  Figure  8  shows  the  samples 
observed  by  Node  5,  and  Figure  9  is  an  overlay  of  the  observations  from  all  9 
sensors. 


Fig.  8  A  realization  of  noisy  samples  of  the  decaying  spiral  observed  by  Node  5 


We  discretize  the  angles  of  the  spiral  so  that  each  sensor  observes  the  spiral  accord¬ 
ing  to  a  discrete  probability  distribution.  The  angle  distributions  as  seen  by  Node  1, 
5,  and  9  are  shown  in  Fig.  10.  As  seen  in  the  figure,  Node  1  and  Node  9  will  observe 
overlapping  segments  of  the  spiral  with  higher  probability  than  Node  5. 
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Fig.  9  Samples  observed  by  all  sensor  nodes.  Each  marker  relates  samples  to  a  sensor  node 


Fig.  10  Distributions  of  angles  of  a  simulated  decaying  spiral  viewed  by  sensor  nodes.  The 
modes  are  centered  at  the  sensor’s  angle  relative  to  the  spiral,  which  is  defined  over  [0,  Air),  in 
the  spiral’s  xy -plane 


For  the  given  set  of  samples,  we  compare  the  results  of  executing  the  scoring  al¬ 
gorithm  at  Node  5  with  only  its  data  against  the  centralized  scoring  method  having 
access  to  observations  from  all  the  nodes.  The  models  are  learned  with  a  change- 
error  tolerance  of  eg  =  10-6.  Figures  11  and  12,  respectively,  display  the  locally 
and  globally  learned  tangent  vectors  of  the  decaying  spiral.  By  itself,  Node  5  leams 
only  a  fraction  of  the  overall  object.  When  federated,  a  central  node  is  able  to  learn 
a  more  rich  appearance  model  of  the  object  despite  limited  views  of  the  object  at 
each  sensor  node. 
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Fig.  11  Estimates  of  the  MFA  parameters  at  Node  5  via  Fisher  scoring 


Fig.  12  Estimates  of  the  MFA  parameters  at  a  central  node  via  Fisher  scoring 


5.  Conclusions 

We  derived  a  constrained  Fisher  scoring  algorithm  that  exploits  block-diagonal  and 
low-rank  structures  of  the  expected  FIM  of  the  complete  data;  this  results  in  sig¬ 
nificantly  faster  computation  per  iteration  compared  to  Newton’s  method.  We  ob¬ 
served  the  scoring  algorithm  converging  faster  than  the  EM  algorithm  at  similar 
per-iteration  computation,  resulting  in  speedup  factors  of  2-10  for  the  examples 
considered.  Finally,  we  demonstrated  the  efficacy  of  the  centralized  learning  ap¬ 
proach  for  efficiently  federating  low-rank  MFA  models  across  an  entire  sensor  net¬ 
work  to  provide  a  global  appearance  model,  even  when  each  sensor  has  a  limited 
view  of  the  object. 
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Appendix  A.  Constrained  Cramer-Rao  Bound 
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Stoica  and  Ng1  developed  an  expression  for  the  constrained  Cramer- Rao  Bound 
(CRB)  for  parametric  estimation  that  does  not  require  the  Fisher  information  matrix 
(FIM)  of  the  unconstrained  problem  to  be  of  full  rank.  The  constrained  CRB  is  given 
by 


E  ((0  -  0)(0  -  0)T)  >  f/(f/TJf/)-1f/T,  (A-l) 

where  6  is  an  unbiased  estimate  of  0  6  R"  and  J  is  the  FIM  of  the  unconstrained 
problem.  The  matrix  U  is  an  orthonormal  basis  for  the  null  space  of  matrix  F  de¬ 
fined  by 

F  =  G  Rkxn  (A-2) 

d0T 

where  f  G  Rfc  is  a  column  vector  of  constraints  on  6  such  that  f{0)  =  0.  It  is 
assumed  that  f  satisfies  all  the  conditions  in  Stoica  and  Ng1,  such  as  the  number  of 
constrains  in  f  are  fewer  than  the  number  of  parameters  in  6  (i.e.,  k  <  n),  and  the 
set  of  parameters  that  satisfies  the  constraints  is  nonempty.  However,  as  we  show 
next,  the  basis  U  need  not  be  orthonormal. 

Let  the  columns  of  U  G  unxn~k  form  a  basis  for  the  null  space  of  F.  The  matrix 
defined  by 


Pu  =  t/([/T[/)_1[/T  (A- 3) 

projects  onto  the  column  space  of  U.  If  U  is  orthonormal,  then  Pu  =  UUT  as  in 
Stoica  and  Ng1.  Define  U  =  U(UTU)~1,  and  W  G  Rnxn  is  an  arbitrary  matrix.  Let 
UTJU  be  positive  definite  with  spectral  decomposition  given  by  UTJU  =  QDQT 
where  Q  is  an  orthogonal  matrix  and  D  is  diagonal  with  positive  entries  along  its 
diagonal.  From  Eq.  9  in  Stoica  and  Ng1,  we  have 

E  ((0  -  0)(0  -  0)T)  >  WPV  +  PuWT  -  WPuJPuWT  (A-4) 

=  WUUT  +  U(WU)T  -  (WUQ)D(WUQY 
=  UQD~1QtUt  -  ( WUQ  -  UQD~1)D(WUQ  -  UQD “1)T. 

(A- 5) 

'Stoica  P,  Ng  BC.  On  the  Cramer-Rao  bound  under  parametric  constraints.  IEEE  Signal  Pro¬ 
cessing  Letters.  1998:5(7):  177-179. 
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The  W  that  maximizes  the  right-hand  side  of  Eq.  A-4  satisfies 


WU  =  UQD~1Qt.  (A-6) 

By  substituting  Eq.  A-6  into  Eq.  A-5,  we  arrive  at  the  lower  bound  in  Eq.  A-l 
without  restricting  U  to  be  orthonormal. 
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Appendix  B.  Complete  Data  Fisher  Information  Matrix  for  an  MFA 
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In  this  section,  we  derive  the  Fisher  information  matrix  (FIM)  for  a  complete-data 
model  of  a  mixture  of  factor  analyzers  (MFA).  We  first  derive  the  complete  data 
FIM  for  a  Gaussian  mixture  model  (GMM)  and  then  through  a  change  of  variables 
arrive  at  the  FIM  for  a  MFA. 

The  complete  data  log-likelihood  of  the  GMM  is  given  by 

M  N 

£c($;  X,Z)  =  1oS  amZrniJ\f(xmi;  S2mi).  (B-l) 

m= 1  i= 1 

The  complete  data  FIM  of  the  GMM  is  defined  by 

=  E  (V*fc($;  X,  Z)V^C($;  X,  Z))  .  (B-2) 

The  gradient  of  t  with  respect  to  $  =  [fij,  crj, . . . ,  nTK,  cr aj, . . . ,  a^]T  is 
given  by 

8(zmi  logAf(x  mii 

3(%mi  log  TV* (&rni)  Ml?  ^l) 

S(zmi  -  K)V„k  log  JV(  'Emii  Mio  ^k) 

&(Zmi  -^0  V crK  log  J\f  (xrrii)  ,  S ft ) 

-  1  )S(m  -  l)^1 

S(zmi  -  K)6(m  -  1  )oy^ 

5(zmi  -  1  )S(m  -  M)a^1 

5{zmi  -  K)8(m  -  M)a^K 

where  8 (a  —  b)  is  1  when  a  —  b  and  0  otherwise.  It  is  straightforward  to  show  that 
the  gradients  in  Eq.  B-3  are  given  by 

log  J\f(xmi:  fij,,  Sfc)  ( xmi  Mfc);  (B-4) 

^ <rk  log  J\f{xmi]  Hk,  Sfc)  =  -vec  ( xmi  )  , 

(B-5) 
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for  k  —  1, . . , ,  K.  It  is  clear  from  Eq.  B-3  that  the  complete  data  FIM  of  the  GMM 
is  block  diagonal,  and  can  therefore  be  expressed  as 


J*  =  dia§  (JM,  ,  Jai ,  •  •  •  ,  J i±K ,  JcrK  i  Jcti  j  •  •  •  j  Jolki)  i 


(B-6) 


where 

M  N 

'y  ^  y  ^  E  (5{zmi  —  k)Tjf,  ( Xmi  —  ^k)ixmi  ~  Mfc)  ) 

m=  1  i= 1 
M 

Nj2^mk  Sfc1,  (B-7) 

m=l 
N 

E  (amLe^eL)  =  Ar  dia§  w1  >  (B-g) 

2=1 

for  A;  =  1, ,  Jl  and  m  =  1, . . . ,  M.  For  the  term  Jak  we  use  the  identity 

E  (V$r($;  X,  Z)V^C($;  X,  Z))  =  E  (-V#V£r($;  X,  Z))  .  (B-9) 

Subsequently,  we  have  for  k  =  1, . . . ,  K 

M  N 

=  E  E  E  (S(zmi  -  k)VakVlk  log  J\f(xmi\  fikl  Sfc)) 

m= 1  2= 1 
M  N  / 

=  ^  ^  ^  ^  E  f  —  k)—  (Sfc  8)  Sfc  (a;mi  —  fik)(xmi  —  /j,k)  Efc  ) 

m= 1  2=1  A 

M  TV 

+ee= 

ra=l  z=l 
M  N 

-ee* 

m=l  2=1 

x  M 

=  j  Yl amk  (sfc 1 0  ^ 1)  •  (B'10) 

m=  1 

We  have  established  the  block-diagonal  form  of  the  complete  data  FIM  of  the 
GMM.  Now  we  relate  the  FIM  of  the  MFA  to  the  FIM  of  the  GMM  through  a 
change  of  variables.  The  factor  means  and  mixing  proportions  remain  unchanged 
between  the  GMM  and  MFA.  The  change  is  limited  to  the  parameterization  of  each 
factor’s  covariance  through  the  structure  given  in  Eq.  3.  The  Jacobian  matrix  for  the 
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transformation  is  given  by 


Ge  = 


<9$ 


Ip  0 

0  G(y1 


Ip  0 

0  Ga-K 

I MK 


(B  -11) 


where  Gak  e  RP2xP(r+1)  is  defined  in  Eq.  33.  The  complete  data  FIM  of  the  MFA 
is  then  given  by 

Je  =  diag  . 

(B-12) 


Approved  for  public  release;  distribution  is  unlimited. 


33 


List  of  Symbols,  Abbreviations,  and  Acronyms 


diag  (AX,A 


Terms: 

EM  expectation-maximization 
FIM  Fisher  information  matrix 
GMM  Gaussian  mixture  model 
MFA  mixture  of  factor  analyzers 
MF  maximum  likelihood 
Mathematical  symbols: 

amk  mixing  proportion  of  the  kth  factor  observed  by  the  mth  sensor  node 
A/,  loading  matrix  of  the  klh  factor 
[if.  mean  of  the  kth  factor 

$  vector  of  unknown  parameters  of  the  GMM  model 
ifjk  uniqueness  of  the  kth  factor 
E/,.  covariance  of  the  kth  factor 
6  vector  of  unknown  parameters  of  the  MFA  model 
xrni  ?'lh  observation  from  the  mth  sensor  node 
Mathematical  operators: 

, . . . ,  An)  a  diagonal  matrix  with  Ax,  A2, . . . ,  An  along  the  diagonal  where  each  A* 
may  be  a  square  matrix  with  differing  sizes  across  i  —  1,  2, . . . ,  n 

E()  the  expected  value  of  a  random  quantity 

max( )  the  maximum  of  an  otherwise  variable  quantity 

V gf  (6)  the  gradient  of  function  /  with  respect  to  6 

{A®  B)  the  Kronecker  product  of  matrix  A  and  B 

vec  (A)  the  vectorization  of  matrix  A.  If  A  =  [ai,  a2, . . . ,  an],  then  vec  (A)  = 
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